Quick Start: Gaussian Graphical Models with Multi-Study Factor Analysis (MSFA-X)

Introduction

This quick start guide is a companion to the preprint “Gaussian graphical modeling of multi-study data with Multi-Study Factor Analysis”(Shutta, Scholtens, et al., 2022).

Installing and loading libraries

The first step is to install and load MSFA-X, which is contained in the extended MSFA R package from the fork available at https://github.com/katehoffshutta/MSFA. We will use the devtools package to do this, so you will need to install devtools first if you don’t have it already.

library(devtools)
install_github("katehoffshutta/MSFA")
library(MSFA)

There are several other packages that we will use downstream to analyze the model results. Install them prior to running this chunk.

library(corrplot)
library(dplyr)
library(forcats)
library(ggplot2)
library(igraph)

Loading and filtering the data

Next, we will load the breast cancer dataset used by (De Vito et al., 2021). This dataset consists of measurements for the same 6362 genes across seven different breast cancer samples. For more details, run ?data_breastCancer05.

data(data_breastCancer05)
length(data_breastCancer05)
## [1] 7
sapply(data_breastCancer05,dim)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]  118  353  200   99  198  133  344
## [2,] 6362 6362 6362 6362 6362 6362 6362

The capacity to estimate GGMs is currently built on the frequentist version of MSFA (De Vito et al., 2019). This does not support high-dimensional data, so it is necessary to first reduce the number of predictors to smaller than the sample size. An area of future work is to extend the Bayesian version of MSFA (De Vito et al., 2021) in a similar way; this would enable graphical modeling for high-dimensional data. In the meantime, for illustration of the low-dimensional cases, we select the 1 percent genes with highest average variability across the studies.

geneVars = sapply(data_breastCancer05,function(x){apply(x,2,var)})
avgGeneVars = apply(geneVars,1,mean)
topVarIdx = which(avgGeneVars > quantile(avgGeneVars,0.99))
topVarGenes = names(avgGeneVars)[topVarIdx]

Next, we filter the breast cancer datasets to only these genes and verify that all datasets are now low-dimensional.

X_s = lapply(data_breastCancer05,function(x){x[,colnames(x) %in% topVarGenes]})
sapply(X_s,dim)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]  118  353  200   99  198  133  344
## [2,]   64   64   64   64   64   64   64

Determining the number of factors

Before running MSFA, it is necessary to estimate the number of shared and study-specific factors. We provide the function get_factor_count for this purpose. This process takes about 10 minutes to run on a standard Macbook Pro, so we have commented it out here and load the object instead.

# nFactors = get_factor_count(X_s, method="cng")
# save(nFactors,file="nFactors.rda")
load("nFactors.rda")

The result contains the estimated number of shared factors:

nFactors$k
## [1] 2

and the estimated number of study-specific factors:

nFactors$j_s
## [1] 1 1 1 1 1 1 1

We use these numbers as input to MSFA-X. The first step is to estimate a starting point for the expectation-conditional maximization (ECM) algorithm used in MSFA, applying the start_msfa function from the original MSFA code.

start_point = start_msfa(X_s = X_s, k = nFactors$k, j_s = nFactors$j_s)

Running MSFA-X

Next, we run the extended MSFA algorithm (MSFA-X). This algorithm will first run the ECM algorithm described in (De Vito et al., 2019). Next, it will estimate the graphical model parameters post hoc. The argument extend = TRUE is what tells MSFA to perform this second step.

msfax_model = ecm_msfa(X_s = X_s, start = start_point, extend = TRUE)
## i= 1000 Criterion for convergence  0.001324233 
## i= 2000 Criterion for convergence  0.0005082112 
## i= 3000 Criterion for convergence  0.001825515 
## i= 4000 Criterion for convergence  0.0004093674 
## i= 5000 Criterion for convergence  9.111513e-05 
## i= 6000 Criterion for convergence  2.034439e-05 
## i= 7000 Criterion for convergence  4.555826e-06 
## i= 8000 Criterion for convergence  1.021952e-06 
## i= 9000 Criterion for convergence  2.294291e-07

We can visually investigate the parameters of the MSFA model using the corrplot package. We see that the two shared loadings Phi1 and Phi2 load on different sets of genes and that the study-specific loadings Lambda*1 are generally somewhat similar, with the exception of Study 1.

allFactors = cbind(msfax_model$Phi, Reduce(cbind, msfax_model$Lambda_s))
rownames(allFactors) = colnames(X_s[[1]])
colnames(allFactors) = c("Phi1","Phi2","Lambda11","Lambda21","Lambda31","Lambda41","Lambda51","Lambda61","Lambda71")
corrplot(allFactors,is.corr=F)

Creating graphs from MSFA-X output

Next, we extract the adjacency matrices defining the GGMs estimated by MSFA-X. Shared and study-specific precision matrices are converted to GGMs using the relationship between the \(i,j\) entry of the precision matrix \(\Theta\) and the partial correlation \(\rho_{ij}\) (Shutta, De Vito, et al., 2022):

\[ \rho_{ij} = -\frac{\Theta_{ij}}{\sqrt{\Theta_{ii}\Theta_{jj}}} \]

This can be done efficiently using the cov2cor function in R:

shared_adjmat = -cov2cor(msfax_model$SharedPrec)
colnames(shared_adjmat) = colnames(msfax_model$X_s[[1]])
study_adjmats = lapply(msfax_model$StudyPrec,function(x){adjmat = -cov2cor(x); colnames(adjmat) = colnames(msfax_model$X_s[[1]]); adjmat})

Next, we will use the igraph R package to convert these matrices into graph objects:

shared_graph = graph_from_adjacency_matrix(shared_adjmat,mode="undirected",diag=F,weighted=T)
study_graphs = lapply(study_adjmats, graph_from_adjacency_matrix,mode="undirected",diag=F,weighted=T)

We can visualize these graphs by using the plot function of igraph. There are many layout options (see ?plot.igraph), but the easiest one for comparing graphs is the circular layout because it preserves the ordering of the nodes in every case.

msfaxPlot = function(my_graph, my_title="")
{
  plot(my_graph,
       vertex.size=2,
       vertex.color="white",
       vertex.label.color="black",
       edge.width = 5*abs(E(my_graph)$weight),
       edge.color = ifelse(E(my_graph)$weight > 0, "red","blue"),
       layout = layout_in_circle)
  title(my_title)
}

msfaxPlot(shared_graph, my_title="Shared")

for(i in 1:7)
  msfaxPlot(study_graphs[[i]], my_title=paste("Study",i))

Study-specific signal: mammaglobin A and lipophilin B in Study 1

It is evident from these plots that there is some variation across studies; for example, in Study 1 we see a strong positive partial correlation between SCGB1D2 and SCGB2A2. These genes encode the proteins mammaglobin A (SCGB2A2) and lipophilin B (SCGB1D2), which form a complex in breast cancer tissue and are associated with ER-positive breast cancer (Zafrakas et al., 2006). Table 1 of (De Vito et al., 2021) shows that chemotherapy and hormone therapy were used as adjuvant therapy in this study. This treatment is distinct from the other six studies, and the observed signal in the Study 1 graph may be indicative of a treatment-specific gene expression signature.

Shared signal: weighted absolute degree and ESR1

We can also investigate the shared graph for signals common across studies, as done in (De Vito et al., 2021). One way to do this is to investigate the importance of individual genes with the strength function of igraph, which calculates a weighted degree. We calculate this weighted degree using absolute edge values, as we consider both positive and negative edges to be important.

shared_degree = strength(shared_graph,weights=abs(E(shared_graph)$weight))
shared_degree[order(-shared_degree)]
##       ESR1    IGKV1-5       TFF3       AGR2   IGLV2-23    IGKV4-1    SCGB2A2 
## 2.05849929 2.00599786 1.62891460 1.56178622 1.55098998 1.47838718 1.45019495 
##      KRT17      KRT16       TFF1       NAT1    SCGB1D2      GABRP   SERPINB5 
## 1.43784990 1.42958170 1.41718963 1.34959053 1.34302117 1.31144282 1.25449434 
##     SCUBE2       IGHM     CYP2B6     CHI3L1      GREB1      KRT14        IGJ 
## 1.22900117 1.06221137 0.99172251 0.98992937 0.98511148 0.97990590 0.96760262 
##      PROM1       TOX3   CYP2B7P1      KRT81       ZIC1      FABP7        PIP 
## 0.92721699 0.85885473 0.84154566 0.79032778 0.78161567 0.75078202 0.68266502 
##     CRABP1     S100A8     CYP4B1      PDZK1      KRT15     S100A9     CXCL13 
## 0.61321340 0.58803647 0.58373814 0.58281595 0.58086658 0.50663251 0.48169505 
##       MMP1      NPY1R       CPB1    SCGB2A1     CALML5     HMGCS2      DHRS2 
## 0.47382235 0.47028485 0.46617781 0.45242153 0.44277217 0.42632570 0.40892904 
##    CEACAM6      OLFM4      GRIA2     TFAP2B       MSMB       AREG     MAGEA3 
## 0.40858839 0.37170604 0.36565013 0.32423894 0.31346172 0.28820488 0.26777883 
##     S100A7     CYP4F8      FABP4    UGT2B28        LTF      CLCA2   HLA-DQB1 
## 0.26702988 0.26467523 0.24681856 0.22818010 0.19626073 0.13875852 0.12704958 
##   HLA-DRB4      MUC5B   HLA-DQA1      GRB14     CRISP3       PSPH       MAOB 
## 0.12080168 0.11244895 0.11100057 0.08936718 0.08825585 0.05892442 0.05303473 
##      S100P 
## 0.02541553

We can visualize this node importance with a lollipop plot.

deg_df = data.frame("gene" = names(shared_degree),
                    "weighted_abs_degree"=shared_degree)

deg_df %>% 
  mutate(gene = fct_reorder(gene,weighted_abs_degree)) %>%
  ggplot(aes(x=gene, y=weighted_abs_degree)) +
  geom_segment(aes(x=gene, xend=gene, y=0, yend=weighted_abs_degree), color="grey") +
  geom_point( color="blue", size=4, alpha=0.6) +
  coord_flip() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.border = element_blank(),
    axis.ticks.y = element_blank()
  )

According to this measure, the ESR1 gene is the most important in the shared network. This recapitulates the importance of ESR1 described in (De Vito et al., 2021).

References

De Vito, R., Bellio, R., Trippa, L., & Parmigiani, G. (2019). Multi-study factor analysis. Biometrics, 75(1), 337–346.
De Vito, R., Bellio, R., Trippa, L., & Parmigiani, G. (2021). Bayesian multistudy factor analysis for high-throughput biological data. The Annals of Applied Statistics, 15(4), 1723–1741.
Shutta, K. H., De Vito, R., Scholtens, D. M., & Balasubramanian, R. (2022). Gaussian graphical models with applications to omics analyses. Statistics in Medicine, 41(25), 5150–5187.
Shutta, K. H., Scholtens, D. M., Lowe Jr, W. L., Balasubramanian, R., & De Vito, R. (2022). Estimating gaussian graphical models of multi-study data with multi-study factor analysis. arXiv Preprint arXiv:2210.12837.
Zafrakas, M., Petschke, B., Donner, A., Fritzsche, F., Kristiansen, G., Knüchel, R., & Dahl, E. (2006). Expression analysis of mammaglobin a (SCGB2A2) and lipophilin b (SCGB1D2) in more than 300 human tumors and matching normal tissues reveals their co-expression in gynecologic malignancies. BMC Cancer, 6, 1–13.